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Abstract 

Image denoising based on a probabilistic model of local image patches has been 
employed by vari ous researchers, and r ecen tly a deep (denoi sing) autoencoder has 
been proposed by Burg er et al.l ll2012l and IXie et al.l ll2012l as a good model for 
this. In this paper, we propose that another popular family of models in the field of 
deep learning, called Boltzmann machines, can perform image denoising as well 
as, or in certain cases of high level of noise, better than denoising autoencoders. 
We empirically evaluate the two models on three different sets of images with 
different types and levels of noise. Throughout the experiments we also examine 
the effect of the depth of the models. The experiments confirmed our claim and 
revealed that the performance can be improved by adding more hidden layers, 
especially when the level of noise is high. 

1 Introduction 

Numerous approaches based on machine learning have been proposed for image denoising tasks 
over time. A dominant approach has been to perform denoising based on local statistics over a whole 
image. For instance, a method denoises each small image patch extracted from a whole noisy image 
and reconstructs the clean image from the denoised patches . Und er this approach, it is possible to 
use raw pixels of each image patch [see, e.g. lHvvarinen et allll99 91 or the representations in another 
domain, for instance, in wavelet domain [see, e.g. Portill aet al.ll2003ll . 

In the case of using raw pixels, sparse coding has been a method of choice. iHvvarinen et al.l Jl999ll 
proposed to use independent component analysis (ICA) to estimate a dictionary of sparse elements 
and compute the sparse code of image patches. Subsequently, a shrinkage nonlinear function is 
applied to the estimated sparse code elements to suppress those elements with small absolute mag- 
nitude. This spar s e cod e elements are used to reconstruct a noise-free image patch. More recently, 
lElad and Aharonl (2006] also showed that sparse overcomplete representation is useful in denoising 
images. Some researchers claimed that bet ter denoising performance can be achi eved by using a 
variant of sparse coding methods [see, e.g. lShang and Huangl 120051 iLu et al.Ll201 ill . 

In essence, these approaches build a probabilistic model of natural image patches using a layer of 
sparse latent variables. The posterior distribution of each noisy patch is either exactly computed or 
estimated, and the noise-free patch is reconstructed as an expectation of a conditional distribution 
over the posterior distribution. 

Based on this interpretation, some researchers have proposed very recently to utilize a (probabilis- 
tic) model that has more than one layers of latent variables for image denoising. iBurger et al.l ||2012|| 
showed that a deep multi-layer perceptron that learns a mapping from a noisy image patch to its 
corres ponding c lean v ersion, can perform as good as the state-of-the-art deno ising methods. Simi- 
larly. IXie~et al. [2012] proposed a variant of a stacked denoising autoencoder I Vincent et all l20foll 
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that is more effective in image denoising. They were also able to show that the denoising approach 
based on deep neural networks performed as good as, or sometimes better than, the conventional 
state-of-the-art methods. 

Along this line of research, we aim to propose a yet another type of deep neural net- 
works for image denoising, in this paper. A G aussian-Bernoulli restricted Boltzmann ma- 
chines (GRBM) |Hinton and Salakhutdinov, 2006] and deep Boltzmann machines (GDBM) 
| Salakhutdinov and Hintonl 120091 ICho et all 1201 lbll are empirically shown to perform well in im- 
age denoising, compared to stacked denoising autoencoders. Furthermore, we extensively evalu- 
ate the effect of the number of hidden layers of both Boltzmann machine-based deep models and 
autoencoder-based ones. The empirical evaluation is conducted using different noise types and lev- 
els on three different sets of images. 



2 Deep Neural Networks 

We start by briefly describing Boltzmann machines and denoising autoencoders which have become 
increasingly popular in the field of machine learning. 

2.1 Boltzmann Machines 



Originally proposed in 1980s, a Boltzmann machine (BM) llAcklev et al.,ll985ll and esp ecially its 



structural constrained version, a restricted Boltzmann machine (RBM) | Smolensky, 1986] have be- 
come increasingly important in machine learning since llHinton ancT Salakhutdinovl l2006ll showed 
that a powerful deep neural network can be trained easily by stacking RBMs on top of each other. 
More recently, another variant of a BM, called a deep Boltzmann machine (DBM), has been pro- 
pos ed and shown to outperform othe r conventional machine learning methods in many tasks [see, 
e.g. lSalakhutdinov and Hintonl l2Q09t l. 

We first describe a Gaussian-Bernoulli DBM (GDBM) that has L layers of binary hidden units and 
a single layer of Gaussian visible units. A GDBM is defined by its energy function 



E(v , h i e) = y. + E $»?W + E 




where v = [vi] i=1 ^ N and 



(1) 

are N v Gaussian visible units and Ni binary 



j=l...N, 

hidden units in the l-th hidden layer. W = [wij] is the set of weights between the visible neurons 



and the first layer hidden neurons, while TjW 



(0 
u )'k 



is the set of weights between the Z-th and 



/ + l-th hidden neurons, a 2 is the shared variance of the conditional distribution of Vi given the 
hidden units. 

With the energy function, a GDBM can assign a probability to each state vector x = 
[v; h/ 1 ) ; • • • ; h^ L '] using a Boltzmann distribution: 

P(x|0) = ^yexp{-£(x|0)}. 

Based on this property the parameters can be learned by maximizing the log-likelihood 

C = J2n=i l°gShP( v ^"' 1 ' h I 0) gi yen N training samples {v^} n =i jv. where h = 
[htt;...;h< £ >]. 

Although the update rules based on the gradients of the log-likelihood function are well de- 
fined, it is intractable to exactly compute them. Hence, an approach that uses variational 
approximation together with M arkov chain Monte Carlo (MCMC) sampling was proposed by 
Salakhutdi nov and Hintonl H2009H . 
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It has, however, been found tha t training a GDBM using th i s approach starting from randomly initial- 
ized p arameter s is not trivial [Salakhut dinov and Hintonl I2009L iDesiardins et all 1201 2i ICho et all 
l2012ll . Hence. [Salakhut dinov and Hintonl l2QQ9tl and lCho et all ll2012ll proposed pretraining algo- 
rithms that can initialize the p arameters of DBMs. In this paper, we use the pretraining algorithm 
proposed bv lCho et all ll2012li . 

A Gaussian-Bernoulli RBM (GRBM) is a special case of a GDBM, where the number of hidden 
layers is restricted to one, L = 1. Due to this restriction it is possible to compute the posterior distri- 
bution over the hidden units conditioned on the visible units exactly and tractably. The conditional 

probability of each hidden unit hj — hj is 

p(h j = i\v,e) = /(X>y4 + c iJ- P) 



Hence, the positive part of the gradient, that needs to be approximated with variational approach 
in the case of GDBMs, can be computed exactly and efficiently. Only the negative part, which is 
computed over the model distribution , still relies on MCMC sampling, or more approximate methods 
such as contrastive divergence (CD) llHintonl [2002ll . 



2.2 Denoising Autoencoders 

A denoising autoencoder (DAE) is a special form of multi-layer perceptron network with 2L — 1 
hidden layers and L — 1 sets of tied weights. A DAE tries to learn a network that reconstructs an 
input vector optimally by minimizing the following cost function: 



N 

E| 

n=l 



where 



/« = 0(W« T hf- 1 >) and 5 « = 0(W»h( 2L - ; )) 



(3) 



are encoding and decoding functions for Z-th layer with a component-wise nonlinearity function <f>. 
WO is the weights between the Z-th and I + 1-th layers and is shared by the encoder and decoder. 
For notational simplicity, we omit biases to all units. 

Unlike an ordinary autoencoder, a DAE explicitly sets some of the components of an input vector 
randomly to zero during learning via rj(-) which explicitly adds noise to an input vector. It is usual 
to combine two diff erent types of noise w hen using ?/(•), which are additive isotropic Gaussian noise 
and masking noise ll Vincent et all I20T0I1 . The first type adds a zero-mean Gaussian noise to each 
input component, while the masking noise sets a set of randomly chosen input components to zeros. 
Then, the DAE is trained to denoise the corrupted input. 

Training a DAE is straightforward using backpropagation algorithm which computes the gradient of 
the objective function using a chain-rule and dynamic programming. IVincent et all 112010 1 proposed 
that training a deep DAE becomes easier when the weights of a deep DAE are initialized by greedily 
pretraining each layer of a deep DAE as if it were a single-layer DAE. In the following experiments 
section, we follow this approach to initialize the weights and subsequently finetune the network with 
the stochastic backpropagation. 



3 Image Denoising 

There are a number of ways to perform image denoising. In this paper, we are interested in an 
approach that relies on local statistics of an image. 

As it has been mentioned earlier, a noisy large image can be denoised by denoising small patches 
of the image and combining them together. Let us define a set of N binary matrices D„ <E W xd 
that extract a set of small image patches given a large, whole image x 6 M. d , where d = whc is the 
product of the width w, the height h and the number of color channels c of the image and p is the size 
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of image patches (e.g., n — 64 if an 8 x 8 image patch). Then, the denoised image is constructed by 
* = (E Dlr e (D„x)^) (j2 D nDjj , (4) 

where is a element-wise division and 1 is a vector of ones, rg(-) is an image denoising function, 
parameterized by 0, that denoises N image patches extracted from the input image x. 

Eq. essentially extracts and denoises all possible image patches from the input image. Then, it 
combines them by taking an average of those overlapping pixels. 

There are several flexibilities in constructing a matrix D. The most obvious one is the size of an 
image patch. Although there is no standard approach, many previous attempts tend to use patch 
sizes as small as 4 x 4 to 17 x 17. Another one, called a stride, is the number of pixels between two 
consecutive patches. Taking every possible patch is one option, while one may opt to overlapping 
patches by only a few pixels, which would reduce the computational complexity. 

One of the popular choices for rg(-) has been to construct a probabilistic model with a set of 
latent variables that describe natural image patches. For instance, sparse coding which is, in 
essence, a probabilistic model with a single layer of latent variables has been a common choice. 
iHvvarinen et al.l lll999[] used ICA and a nonli near shrinkage function to compute the sparse code of 
an image patch, while Ifelad and Aharonl (2006] used K-SVD to build a sparse code dictionary. 

Under this approach denoising can be considered as a two-step reconstruction. Initially, the posterior 
distribution over the latent variables is computed, or estimated, given an image patch. Given the 
estimated posterior distribution, the conditional distribution, or its mean, over the visible units is 
computed and used as a denoised image patch. 

In the following subsections, we describe how rg(-) can be implemented in the cases of Boltzmann 
machines and denoising autoencoders. 



3.1 Boltzmann machines 



We consider a BM with a set of Gaussian visible units v that correspond to the pixels of an image 
patch and a set of binary hidden units h. Then, the goal of denoising can be written as 

P(v I v) = 5>(v | h)p(h | v) = E h ,v [p(v | h)] , (5) 

h 

where v is a noisy input patch. In other words, we find a mean of the conditional distribution of the 
visible units with respect to the posterior distribution over the hidden units given the visible units 
fixed to the corrupted input image patch. 

However, since taking the expectation over the posterior distribution is usually not tractable nor 
exactly computable, it is often easier to approximate the quantity. We approximate the marginal 
conditional distribution (O of v given v with 

p(v | v) w p(v | h)Q(h) =p(v | h = n), 

where fi = Eg(h) [h] an d Q(h) is a fully factorial distribution that approximates the posterior 
p(h | v). It is usual to use the mean of v under p(v | v) as the denoised, reconstructed patch. 

Following this approach, given a noisy image patch v a GRBM reconstructs a noise-free patch by 

N h 

Vi = y^WfjE[h | v] +bi, 

where is a bias to the i-th visible unit. The conditional distribution over the hidden units can be 
computed exactly from Eq. (0. 

Unlike a GRBM, the posterior distribution of the hidden units of a GDBM is neither tractably 
computable nor has an anal ytical form. [S alakhutd inov and Hintonl J2009] proposed to utilize 
a variational approximation IINeal and Hintonl Il999ll with a fully-factored distribution Q(h) = 
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T[t=i Ylj Mj> where the variational parameters fij's can be found by the following simple fixed- 
point update rule: 

(N,-! N t+1 \ 

E^" 1) -r i, + E^ +1) -g + 4° . (6) 
1=1 k=l J 

where /(x) = 1+c ^ { _ x} - 

Once the variational parameters are converged, a GDBM reconstructs a noise-free patch by 

N, 

Vi = E^yMj + b *- 

3=1 

The convergence of the variational parameters can take too much time and may not be suitable in 
practice. Hence, in the experiments, we initialize the variational para meters by feed-forward prop- 
agation using the doubled weights ISalakhutdin ov and Hintonl l2009ll and perform the fixed-point 
update in Eq. (|6]i for at most five iterations only. This turned out to be a good enough compromise 
that least sacrifices the performance while reducing the computational complexity significantly. 

3.2 Denoising autoencoders 

An encoder part of a DAE can be considered as performing an approximate inference of a fully- 
factori al posterior distribut ion of top-layer hidden units, i.e. a bottleneck, given an input image 
patch I Vincent et all l2010ll . Hence, a similar approach to the one taken by BMs can be used for 
DAEs. 

Firstly, the variational parameters fj,^ of the fully-factorial posterior distribution Q(h^) = 
Y[j (Jtj 1 ^ are computed by 

M W 0...0/W (v). 

Then, the denoised image patch can be reconstructed by the decoder part of the DAE. This can be 
done simply by propagating the variational parameters through the decoding nonlinearity functions 
g( l > such that 

V = gWo...og( L -V (uW\ . 

Recently, Burg er etaLl Il2012ll and IXie et al.l Il2012ll tried a deep DAE in this manner to perform 
image denoising. Both of them reported that the denoising performance achieved by DAEs is com- 
parable, or sometim es favora ble, to other conven tional image denoising methods such as BM3D 
llDabov et all I2007H. K-SVD iPortilla et all I2003H and Bayes Least Squares-Gaussian Scale Mix- 
ture ||Elad_and Aharonl |2006ll . 

4 Experiments 

In the experiments, we aim to empirically compare the two dominant approaches of deep learning, 
namely Boltzmann machines and denoising autoencoders, in image denoising tasks. 

There are several questions that are of interest to us: 

1 . Does a model with more hidden layers perform better? 

2. How well does a deep model generalize? 

3. Which family of deep neural networks is more suitable, Boltzmann machines or denoising 
autoencoders? 

In order to answer those questions, we vary the depth of the models (the number of hidden layers), 
the level of noise injection, the type of noise-either white Gaussian additive noise or salt-and-pepper 
noise, and the size of image patches. Also, as our interest lies in the generalization capability of the 
models, we use a completely separate data set for training the models and apply the trained models 
to three distinct sets of images that have very different properties. 
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(a) Textures (b) Aerials (c) Miscellaneous 

Figure 1: Sample images from the test image sets 

4.1 Datasets 

We used three sets of images, textures, aerials and miscellaneous, from the USC-SIPI Image 
Databas43 as test images. Tab. Q] lists the details of the image sets, and Fig. Q]presents six sam- 
ple images from the test sets. 

These datasets are, in terms of contents and properties of images, very different from each other. For 
instance, most of the images in the texture set have highly repetitive patterns that are not present in 
the images in the other two sets. Most images in the aerials set have both coarse and fine structures, 
for example, a lake and a nearby road, at the same time in a single image. Also, the sizes of the 
images vary quite a lot across the test sets and across the images in each set. 



Set 


# of all images 


# of color images 


Min. Size 


Max. Size 


Textures 


64 





512 x 512 


1024 x 1024 


Aerials 


38 


37 


512 x 512 


2250 x 2250 


Miscellaneous 


44 


16 


256 x 256 


1024 x 1024 



Table 1: Descriptions of the test image sets. 



As we are aiming to evaluate the performance of denoising a very general image, we used a large 
separate data set of natural image patches to train the models. We ext racted a set of 100. 000 random 
image patches of sizes 4x4, 8x8 and 16 x 16fromCIFAR-10 dataset [Krizhevskv, 2009]. From each 
image of 50, 000 training samples of the CIFAR-10 dataset, two patches from randomly selected 
locations have been collected. 

We tried denoising only grayscale images. When an image was in an RGB format, we averaged the 
three channels to make the image grayscale. 



4.2 Denoising Settings 

We tried three different depth settings for both Boltzmann machines and denoising autoencoders; a 
single hidden layer, two hidden layers and four hidden layers. The sizes of all hidden layers were set 
to have the same number of hidden units, which was the constant factor multiplied by the number of 
pixels in an image patcrQ. 

We denote Boltzmann machines with one, two and four hidden layers by GRBM, GDBM(2) and 
GDBM(4), respectively. Denoising autoencoders are denoted by DAE, DAE(2) and DAE(4), re- 
spectively. For each model structure, Each model was trained on image patches of sizes 4 x 4, 8 x 8 
and 16 x 16. 

The GRBMs were trained using the enhanced gradient flCho et al.Ll201 ldl and persistent contrastive 
divergence (PCD) [Tieleman, 2008]. The GDB Ms were trained by PCD after initializing the param- 
eters with a two-stage pretraining algorithm llCho et al.l 1201 2ll . DAEs were trained by a stochastic 
backpropagation algorithm, and when there were more th an one hidden layers, we pretrained e ach 
layer as a single-layer DAE with sparsity target set to 0.1 ll Vincent et all 120 fol iLee et alll2008ll . 

The details on training procedures are described in Appendix lAl 

One important difference to the recent work by IXie et al.l Il2012ll and iBurger et al.l Il2012ll is that 
the denoising task we consider in this paper is completely blind. No prior knowledge about target 

1 http://sipi.usc.edu/database/ 
We used 5 as sues ested bv lXie et al.1 12012H . 
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White noise 



Salt-and-pepper noise 



Aerials 



Textures 



Misc. 




Noise Level Noise Level 



^■ihi-<:> ^■ihKi4i ^|i;khm 

Figure 2: PSNR of grayscale images corrupted by different types and levels of noise. The median 
PSNRs over the images in each set together. 

images and the type or level of noise was assumed when training the deep neural networks. In 
other words, no sepa rate training w as done for different types or levels of noise injected to the test 
images. Unlike this, IXie et al.1 ll2012ll . for instance, trained a DAE specifically for each noise level 
by changing ?7(-) accordingly. Furthermore, the Boltzmann machines that we propose here for image 
denoising, do not require any prior knowledge about the level or type of noise. 

Two types of noise have been tested; white Gaussian and salt-and-pepper. White Gaussian noise 
simply adds zero-mean normal random value with a predefined variance to each image pixel, while 
salt-and-pepper noise sets a randomly chosen subset of pixels to either black or white. Furthermore, 
three different noise levels (0.1, 0.2 and 0.4) were tested. In the case of white Gaussian noise, they 
were used as standard deviations, and in the case of salt-and-pepper noise, they were used as a noise 
probability. 

Afte r noise was injecte d, each image was preproce ssed by pixel-wise adap tive Wiener filtering [see, 
e.g JSonka et aUl2007ll . following the approach of Hvvarin en et al.l JT999]. The width and height of 
the pixel neighborhood were chosen to be small enough (3 x 3) so that it will not remove too much 
detail from the input image. 

Denoising performance was measured mainly with peak signal-to-noise ratio (PSNR) computed by 
— 10 log 10 (e 2 ), where e 2 is a mean squared error between an original clean image and the denoised 
one. 

4.3 Results and Analysis 

In Fig. [2] the performances of all the tested models trained on 8 x 8 image patches are presentecQ. 

The most obvious observation is that the deep neural networks, including both DAEs and BMs, 
did not show improvement over their shallow counterparts in the low-noise regime (0. 1). However, 
the deeper models significantly outperformed their corresponding shallow models as the level of 



3 Those trained on patches of different sizes showed similar trend, and they are omitted in this paper. 
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injected noise grew. In other words, the power of the deep models became more evident as the 
injected level of noise grew. 

This is supported further by Tab. |2] which shows the performance of the models in the high noise 
regime (0.4). In all cases, the deeper models, such as DAE(4), GDBM(2) and GDBM(4), were the 
best performing models. 

Another notable phenomenon is that the GDBMs tend to lag behind the DAEs, and even the GRBM, 
in the low noise regime, except for the textures set. A possible explanation for this rather poor 
performance of the GDBMs in the low noise regime is that the approximate inference of the pos- 
terior distribution, used in this experiment, might not have been good enough. For instance, more 
mean-field iterations might have improved the overall performance while dramatically increasing 
the computational time, which would not allow GDBMs to be of any practical value. The GDBMs, 
however, outperformed, or performed comparably to, the other models when the level of injected 
noise was higher. 

It should be noticed that the performance depended on the type of the test images. For instance, 
although the images in the aerials set corrupted with salt-and-pepper noise were best denoised by the 
DAE with four hidden layers, the GDBMs outperformed the DAE(4) in the case of the textures set. 
We emphasize here that the deeper neural networks showed less performance variance depending 
on the type of test images, which suggests better generalization capability of the deeper neural 
networks. 

Visual inspection of the denoised images provides some more intuition on the performances of the 
deep neural networks. In Fig. [3] the denoised images of a sample image from each test image set are 
displayed. It shows that BMs tend to emphasize the detailed structure of the image, while DAEs, 
especially ones with more hidden layers, tend to capture the global structure. 

Additionally, we tried the same set of experiments using the models trai ned on a set of 50 , 000 
random image patches extracted from the Berkeley Segmentation Dataset lMartin et ail ll200lll . In 
this case, 100 patches from randomly chosen locations from each of 500 images were collected to 
form the training set. We obtained the results similar to those presented in this paper. The results 
are presented in Appendix 151 



Method 


Aerials 


Textures 


Misc. 


Aerials 


Textures 


Misc. 


Wiener 


15.7 (o.i) 


15.5 (o.6) 


15.9 (0.6) 


16.3 (0.5) 


14.9 (i.3) 


15.3 (i.5) 


DAE 


16.4 (0.2) 


16.2 (0.9) 


16.6 (0.8) 


17.1 (0.6) 


15.7 d.4) 


16.3 d.7) 


DAE(2) 


17.6 (0.2) 


17.1 (1.2) 


17.7(i.i) 


18.1 (0.7) 


16.4 d.7) 


17.3 (2.0) 


DAE(4) 


20.8 (o.7) 


18.7 (2.8) 


20.2 (2.0) 


20.1 (i.i) 


17.2 (2.8) 


19.0 (2 7) 


GRBM 


19.2 (0.4) 


18.0 (i7) 


18.9 a s) 


18.9 (0.9) 


16.6 (2i) 


17.6 (2.2) 


GDBM(2) 


22.3 (i.4) 


18.7 (3.2) 


20.1 (2.4) 


20.3 (i.4) 


16.5 (3.0) 


17.5 (2.6) 


GDBM(4) 


22.1 (i.i) 


18.7 (3.o) 


20.2 (2.2) 


20.3 (i.3) 


16.6 (2.9) 


17.6 (2.5) 



(a) White Gaussian Noise (b) Salt-and-Pepper Noise 



Table 2: Performance of the models trained on 4 x 4 image patches when the level of injected noise 
was 0.4. Standard deviations are shown inside the parentheses, and the best performing models are 
marked bold. 

5 Conclusion 

In this paper, we proposed that, in addition to DAEs, Boltzmann machines, GRBMs and GDBMS, 
can also be used for denoising images. Furthermore, we tried to find empirical evidence supporting 
the use of deep neural networks in image denoising tasks. 

Our experiments suggest the following conclusions for the questions raised earlier: 
Does a model with more hidden layers perform better? 

In the case of DAEs, the experiments clearly show that more hidden layers do improve performance, 
especially when the level of noise is high. This does not always apply to BMs, where we found that 
the GRBMs outperformed, or performed as well as, the GDBMs in few cases. Regardlessly, in the 
high noise regime, it was always beneficial to have more hidden layers. 
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Original Noisy DAE DAE(4) GRBM GDBM(4) 




16.79 18.96 18.10 18.96 

(a) Textures 




17.59 21.34 19.42 18.98 

(c) Miscellaneous 



Figure 3: Images, corrupted by salt-and-pepper noise with 0.4 noise probability, denoisedby various 
deep neural networks trained on 8 x 8 image patches. The number below each denoised image is 
thePSNR. 

How well does a deep model generalize? 

The deep neural networks were trained on a completely separate dataset and were applied to three 
test sets with very different image properties. It turned out that the performance depended on each 
test set, however, with only small differences. Also, the trend of deeper models performing better 
could be observed in almost all cases, again especially with high level of noise. This suggests that 
a well-trained deep neural network can perform blind image denoising, where no prior information 
about target, noisy images is available, well. 

Which family of deep neural networks is more suitable, BMs or DAEs? 

The DAE with four hidden layers turned out to be the best performer, in general, beating GDBMs 
with the same number of hidden layers. However, when the level of noise was high, the Boltzmann 
machines such as GRBM and GDBM(2) were able to outperform the DAEs, which suggests that 
Boltzmann machines are more robust to noise. 

One noticeable observation was that the GRBM outperformed, in many cases, the DAE with two 
hidden layers which had twice as many parameter. This potentially suggests that a better inference 
of approximate posterior distribution over the hidden units might make GDBMs outperform, or 
comparable to, DAEs with the same number of hidden layers and units. More work will be required 
in the future to make a definite answer to this question. 

Although it is difficult to make any general conclusion from the experiments, it was evident that 
deep models, regardless of whether they are DAEs or BMs, performed better and were more robust 
to the level of noise than their more shallow counterparts. In the future, it might be appealing 
to investigate the possibility of combining multiple deep neural networks with various depths to 
achieve better denoising performance. 
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A Training Procedures: Details 



Here, we describe the procedures used for training the deep neural networks in the experiments. 
A.1 Data Preprocessing 

Prior to training a model, we normalized each pixel of the training set such that, across all the 
training samples, the mean and variance of each pixel are and 1. 

The original mean and variance were discarded after training. During test, we computed the mean 
and variance of all image patches from each test image and used them instead. 

A.2 Denoising Autoencoders 

A single-layer DAE was trained by the stochastic gradient descent for 200 epochs. A minibatch of 
size 128 was used at each update, and a single epoch was equivalent to one cycle over all training 
samples. 

The initial learning rate was set to 770 = 0.05 and was decreased over training according to the 
following schedule: 

?7o 

x ~ 5000 

In order to encourage the sparsity of the hidden units, we used the following regularization term 

N q / / p 

( p ~ $ ( X] v i n)w ij + c j 

n=l j = l \ \ i=l 

where A = 0.1, p = 0.1 and p and q are respectively the numbers of visible and hidden units. is a 
sigmoid function. 

Before computing the gradient at each update, we added a white Gaussian noise of standard deviation 
0.1 to all components of an input sample and forced randomly chosen 20% of input units to zeros. 

The weights of a deep DAE was first initialized by layer-wise pretraining. During the pretraining, 
each layer was trained as if it were a single-layer DAE, following the same procedure described 
above, except that no white Gaussian noise was added for the layers other than the first one. 

After pretraining, we trained the deep DAE with the stochastic backpropagation algorithm for 200 
epochs using minibatches of size 128. The initial learning rate was chosen to be r/o — 0.01 and the 
learning rate was annealed according to 

Vo 

= TT=- 

± ~ 5000 

For each denoising autoencoder regardless of its depth, we used a tied set of weights for the encoder 
and decoder. 

A.3 Restricted Boltzmann Machines 

We used the m odified energy function of a Gaussian-Bernoulli RBM (GRBM) proposed by 
ICho et all ||201 lall . however, with a single a 1 shared across all the visible units. Each GRBM was 
trained for 200 epochs, and each update was performed using a minibatch of size 128. 

A learning rate was automatically selected by the adaptive learning rate llCho et al.L |2Q 1 lall with the 
initial learning rate and the upper-bound fixed to 0.001 and 0.001, respectively. After 180 epochs, 
we decreased the learning rate according to 

where t denotes the number of updates counted after 180 epochs of training. 




12 



A persistent contrastive divergence (PCD) llTielemanll2008ll was used, and at each update, a single 
Gibbs step was t aken for the model samples. Together with PCD, we used the enhanced gradient 
llCho et al.U2013ll . instead of the standard gradient, at each update. 

A.4 Deep Boltzmann Machines 

We used the two-stage pretraining algorithm llCho et al.L 1201 2ll to initialize the parameters of each 
DBM. The pretraining algorithm consists of two separate stages. 

We utilized the already trained single-layer and two-layer DAEs to compute the activations of the 
hidden units in the even-numbered hidden layers of GDBMs. No separate, further training was done 
for those DAEs in the first stage. 

In the secon d stage, the model wa s trained as an RBM using the coupled adaptive simulated temper- 
ing [CAST, Salakh uTdinovl l20T(ill with the base inverse temperature set of 0.9 and 50 intermediate 
chains between the base and model distributions. At least 50 updates were required to make a swap 
between the slow and fast samples. 

The initial learning rate was set to ?yo = 0.01 and the learning rate was annealed according to 

Vo 

= TT^- 

1 T 5000 

Again, the modified form of an energy function llCho etal.Ll2011bll was used with a shared variance 
er 2 for all the visible units. However, in this case, we did not use the enhanced gradient. 

After pretraining, the GDBMs were further finetuned using the stocha stic gradient method together 
with the variational approximation ISalakhutdin ov and Hintonl l2009ll . The CAST was again used 
with the same hyperparameters. The initial learning rate was set to 0.0005 and the learning rate was 
decreased according to the same schedule used during the second stage. 
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Figure 4: PSNR of grayscale images corrupted by different types and levels of noise. The median 
PSNRs over the images in each set together. The models used for denoising in this case were trained 
on the training set constructed from the Berkeley Segmentation Dataset. 

B Result Using a Training Set From Berkeley Segmentation Dataset 

Fig.|4]shows the result obtained by the models trained on the training set constructed from the Berke- 
ley Segmentation Dataset. Although we see some minor differences, the overall trend is observed to 
be similar to that from the experiment in the main text (see Fig. |2). 

Especially in the high-noise regime, the models with more hidden layers tend to outperform those 
with only one or two hidden layers. This agrees well with what we have observed with the models 
trained on the training set constructed from the CIFAR-10 dataset. 
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